feat[Go]: implement chunk REST APIs (list, add, update, switch)#15670
feat[Go]: implement chunk REST APIs (list, add, update, switch)#15670hunnyboy1217 wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds dataset/document-scoped chunk REST endpoints (list with pagination/filters, add with deterministic IDs + embeddings, partial update with conditional re-embed, and bulk availability toggle), router wiring, access/embedding helpers, and response normalization. ChangesChunk REST API Implementation
Sequence Diagram — AddChunk flowsequenceDiagram
participant Client
participant ChunkHandler
participant ChunkService
participant DocEngine
participant EmbeddingService
Client->>ChunkHandler: POST /datasets/:ds/documents/:doc/chunks {AddChunkRequest}
ChunkHandler->>ChunkService: AddChunk(datasetID, documentID, userID, req)
ChunkService->>DocEngine: Verify document belongs to KB / insert chunk
ChunkService->>EmbeddingService: embedTexts([doc.Name, embedInput])
EmbeddingService-->>ChunkService: vector
ChunkService->>DocEngine: store chunk + q_<len>_vec
ChunkService-->>ChunkHandler: {"chunk": {...}}
ChunkHandler-->>Client: HTTP 200 {code,data,message}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
internal/service/chunk.go (1)
1360-1361:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSame panic risk with
tokenizer.Tokenizeresult cast.This has the same issue as in
AddChunk- direct type assertion without safety check.🛡️ Proposed fix
// Tokenize. contentLtks, _ := tokenizer.Tokenize(content) - contentSmLtks, _ := tokenizer.FineGrainedTokenize(contentLtks.(string)) + contentLtksStr, ok := contentLtks.(string) + if !ok { + contentLtksStr = "" + } + contentSmLtks, _ := tokenizer.FineGrainedTokenize(contentLtksStr)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1360 - 1361, The code unsafely type-asserts tokenizer.Tokenize(...) result to string (contentLtks.(string)) which can panic; change the call sites (where contentLtks and contentSmLtks are set) to use the two-value type assertion or a type switch to verify the returned value is a string before using it, and propagate or handle errors from tokenizer.Tokenize and tokenizer.FineGrainedTokenize instead of ignoring them; update the logic around variables contentLtks and contentSmLtks and the calls to tokenizer.Tokenize and tokenizer.FineGrainedTokenize so you return/log an error or bail out if the result type is unexpected.
🧹 Nitpick comments (1)
internal/handler/chunk.go (1)
367-381: 💤 Low valueConsider using
strconv.Atoiinstead ofjson.Numberfor query parameter parsing.Using
json.Number(v).Int64()for string-to-int conversion is unconventional.strconv.Atoiorstrconv.ParseIntis more idiomatic for parsing query parameters and avoids the intermediatejson.Numbertype.♻️ Suggested refactor
page := 1 if v := c.Query("page"); v != "" { - if p, err := json.Number(v).Int64(); err == nil && p > 0 { + if p, err := strconv.Atoi(v); err == nil && p > 0 { page = int(p) } } pageSize := 30 if v := c.Query("page_size"); v != "" { - if ps, err := json.Number(v).Int64(); err == nil && ps > 0 { + if ps, err := strconv.Atoi(v); err == nil && ps > 0 { if ps > 100 { ps = 100 } - pageSize = int(ps) + pageSize = ps } }This would also require adding
"strconv"to the imports.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/handler/chunk.go` around lines 367 - 381, The current parsing of query params for pagination (page and pageSize variables in internal/handler/chunk.go) uses json.Number(v).Int64(), which is unconventional; replace those conversions with strconv.Atoi (or strconv.ParseInt) to parse c.Query("page") and c.Query("page_size"), keep the same validation (positive values and cap pageSize to 100), and add "strconv" to imports; ensure you still default page=1 and pageSize=30 when parsing fails or values are empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/service/chunk.go`:
- Around line 1431-1459: SwitchChunks currently updates chunks by id without
verifying they belong to documentID; modify SwitchChunks to include the document
ownership check: when building the update condition passed to
s.docEngine.UpdateChunks include "doc_id": documentID (in addition to "id":
chunkID) or first fetch the chunk via s.docEngine (e.g., GetChunk/GetChunks) and
verify chunk.DocID equals documentID before calling UpdateChunks; ensure you
still respect tenantID/indexName/datasetID and keep available_int logic and
error logging (use the same condition/update variable names to locate the
change).
---
Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1360-1361: The code unsafely type-asserts tokenizer.Tokenize(...)
result to string (contentLtks.(string)) which can panic; change the call sites
(where contentLtks and contentSmLtks are set) to use the two-value type
assertion or a type switch to verify the returned value is a string before using
it, and propagate or handle errors from tokenizer.Tokenize and
tokenizer.FineGrainedTokenize instead of ignoring them; update the logic around
variables contentLtks and contentSmLtks and the calls to tokenizer.Tokenize and
tokenizer.FineGrainedTokenize so you return/log an error or bail out if the
result type is unexpected.
---
Nitpick comments:
In `@internal/handler/chunk.go`:
- Around line 367-381: The current parsing of query params for pagination (page
and pageSize variables in internal/handler/chunk.go) uses
json.Number(v).Int64(), which is unconventional; replace those conversions with
strconv.Atoi (or strconv.ParseInt) to parse c.Query("page") and
c.Query("page_size"), keep the same validation (positive values and cap pageSize
to 100), and add "strconv" to imports; ensure you still default page=1 and
pageSize=30 when parsing fails or values are empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85724050-8e76-4982-be04-53bbaa180426
📒 Files selected for processing (3)
internal/handler/chunk.gointernal/router/router.gointernal/service/chunk.go
|
Hi, @Haruko386 , |
722d990 to
52752d2
Compare
|
Hi @hunnyboy1217 I will review your code soon |
Haruko386
left a comment
There was a problem hiding this comment.
@hunnyboy1217 plz fix CI error first
|
@hunnyboy1217 |
64e7070 to
fef32ae
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
internal/service/chunk.go (1)
1458-1474:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScope the bulk update to the requested document.
documentIDis never used after access resolution, and each update filters only byid. This endpoint is still not document-scoped, so supplied chunk IDs from other documents in the same dataset can be toggled here.🩹 Proposed fix
for _, chunkID := range chunkIDs { - condition := map[string]interface{}{"id": chunkID} + condition := map[string]interface{}{"id": chunkID, "doc_id": documentID} update := map[string]interface{}{"available_int": availInt} if err := s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID); err != nil { common.Warn("SwitchChunks: failed to update chunk", zap.String("chunkID", chunkID), zap.Error(err)) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1458 - 1474, The update loop currently uses condition := {"id": chunkID} so it ignores the resolved document scope (documentID) and can flip chunks from other documents; change the condition used in s.docEngine.UpdateChunks to include the document identifier (e.g., add "document_id": documentID or the appropriate field name) so the bulk update in the loop (function using resolveDatasetAccess, variables chunkIDs, available, indexName) is scoped to the requested document, leaving indexName and datasetID usage unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/service/chunk.go`:
- Around line 1394-1438: When re-embedding (in the block using
getEmbeddingModelForKB, embedTexts, weightedVec) preserve existing persisted
questions if req.Questions is nil: read d["question_kwd"] (or the stored
question list) into the local questions variable when req.Questions == nil so
embedInput uses the persisted questions rather than empty content; if
req.Questions is provided keep current behavior and update d["question_kwd"] as
you already do. Ensure questions is populated before building embedInput and
before computing the weighted vector so vectors stay consistent with persisted
question_kwd.
- Around line 1285-1289: The code currently swallows errors from
s.documentDAO.UpdateByID after a successful chunk insert; capture and handle the
UpdateByID error instead of ignoring it — either wrap the chunk creation and the
UpdateByID call in a single DB transaction and rollback on any error, or check
the returned error from s.documentDAO.UpdateByID and return it (or propagate a
wrapped error) so the API reports failure; ensure you use the existing chunk
insert call and s.documentDAO.UpdateByID symbols and do not leave the counter
update silent on failure.
- Around line 1124-1136: The REST payload builds the result map using write-side
field names (e.g. "content_with_weight", "important_kwd", "question_kwd",
"docnm_kwd") which causes normalized search-hit data to be dropped; update the
map construction around the result variable so it reads the normalized
search-hit keys used elsewhere in this file's List/Get paths (e.g. "content",
"important_keywords", "questions", "docnm") while still using the existing
helpers (orSlice, orStr, intToBool) and datasetID/image_id handling; locate the
result map creation in chunk.go and replace the write-side keys with the
normalized names so responses include existing chunk values.
- Around line 1000-1022: getEmbeddingModelForKB currently errors when no
KB-specific embedding is set; instead reuse the existing resolveEmbeddingModel
fallback: when embdID is empty after the KB lookup in getEmbeddingModelForKB,
call resolveEmbeddingModel(tenantID) (or the service method that implements that
tenant-default resolution) to obtain the tenant default embdID/model, handle its
error, and then call modelProviderSvc.GetEmbeddingModel(tenantID, embdID).
Update error paths to preserve original error context from resolveEmbeddingModel
and remove the unconditional "no embedding model configured" error in favor of
the resolved result.
---
Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1458-1474: The update loop currently uses condition := {"id":
chunkID} so it ignores the resolved document scope (documentID) and can flip
chunks from other documents; change the condition used in
s.docEngine.UpdateChunks to include the document identifier (e.g., add
"document_id": documentID or the appropriate field name) so the bulk update in
the loop (function using resolveDatasetAccess, variables chunkIDs, available,
indexName) is scoped to the requested document, leaving indexName and datasetID
usage unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ece3b9fa-5f66-4d97-99e7-9c47e62197b9
📒 Files selected for processing (3)
internal/handler/chunk.gointernal/router/router.gointernal/service/chunk.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/handler/chunk.go
| func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) { | ||
| tenantLLMDAO := dao.NewTenantLLMDAO() | ||
| modelProviderSvc := NewModelProviderService() | ||
|
|
||
| var embdID string | ||
| var err error | ||
| if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 { | ||
| _, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID) | ||
| } else if kb.EmbdID != "" { | ||
| parts := strings.Split(kb.EmbdID, "@") | ||
| if len(parts) == 2 && parts[1] != "" { | ||
| _, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding) | ||
| } else { | ||
| _, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding) | ||
| } | ||
| } | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to resolve embedding model: %w", err) | ||
| } | ||
| if embdID == "" { | ||
| return nil, fmt.Errorf("no embedding model configured for dataset") | ||
| } | ||
| return modelProviderSvc.GetEmbeddingModel(tenantID, embdID) |
There was a problem hiding this comment.
Reuse the existing embedding-model fallback here.
resolveEmbeddingModel already falls back to the tenant default when a KB has no explicit embedding model, but getEmbeddingModelForKB returns an error instead. That makes AddChunk and UpdateChunkREST fail for datasets that still rely on the tenant default embedding configuration.
🩹 Proposed fix
func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {
- tenantLLMDAO := dao.NewTenantLLMDAO()
- modelProviderSvc := NewModelProviderService()
-
- var embdID string
- var err error
- if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 {
- _, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID)
- } else if kb.EmbdID != "" {
- parts := strings.Split(kb.EmbdID, "@")
- if len(parts) == 2 && parts[1] != "" {
- _, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding)
- } else {
- _, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding)
- }
- }
- if err != nil {
- return nil, fmt.Errorf("failed to resolve embedding model: %w", err)
- }
- if embdID == "" {
- return nil, fmt.Errorf("no embedding model configured for dataset")
- }
- return modelProviderSvc.GetEmbeddingModel(tenantID, embdID)
+ return s.resolveEmbeddingModel(tenantID, kb)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) { | |
| tenantLLMDAO := dao.NewTenantLLMDAO() | |
| modelProviderSvc := NewModelProviderService() | |
| var embdID string | |
| var err error | |
| if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 { | |
| _, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID) | |
| } else if kb.EmbdID != "" { | |
| parts := strings.Split(kb.EmbdID, "@") | |
| if len(parts) == 2 && parts[1] != "" { | |
| _, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding) | |
| } else { | |
| _, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding) | |
| } | |
| } | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to resolve embedding model: %w", err) | |
| } | |
| if embdID == "" { | |
| return nil, fmt.Errorf("no embedding model configured for dataset") | |
| } | |
| return modelProviderSvc.GetEmbeddingModel(tenantID, embdID) | |
| func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) { | |
| return s.resolveEmbeddingModel(tenantID, kb) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/service/chunk.go` around lines 1000 - 1022, getEmbeddingModelForKB
currently errors when no KB-specific embedding is set; instead reuse the
existing resolveEmbeddingModel fallback: when embdID is empty after the KB
lookup in getEmbeddingModelForKB, call resolveEmbeddingModel(tenantID) (or the
service method that implements that tenant-default resolution) to obtain the
tenant default embdID/model, handle its error, and then call
modelProviderSvc.GetEmbeddingModel(tenantID, embdID). Update error paths to
preserve original error context from resolveEmbeddingModel and remove the
unconditional "no embedding model configured" error in favor of the resolved
result.
| result := map[string]interface{}{ | ||
| "id": chunk["id"], | ||
| "content": chunk["content_with_weight"], | ||
| "document_id": chunk["doc_id"], | ||
| "docnm_kwd": chunk["docnm_kwd"], | ||
| "important_keywords": orSlice(chunk["important_kwd"]), | ||
| "questions": orSlice(chunk["question_kwd"]), | ||
| "tag_kwd": orSlice(chunk["tag_kwd"]), | ||
| "dataset_id": datasetID, | ||
| "image_id": orStr(chunk["img_id"]), | ||
| "available": intToBool(chunk["available_int"]), | ||
| "positions": orSlice(chunk["position_int"]), | ||
| } |
There was a problem hiding this comment.
Normalize search hits before building the REST payload.
This code reads write-side field names (content_with_weight, important_kwd, question_kwd, docnm_kwd), but the existing List/Get paths in this file normalize raw search-hit fields like content, important_keywords, questions, and docnm. As written, REST list responses will drop those values for existing chunks.
🩹 Proposed fix
result := map[string]interface{}{
"id": chunk["id"],
- "content": chunk["content_with_weight"],
+ "content": utility.FirstNonNil(chunk["content_with_weight"], chunk["content"]),
"document_id": chunk["doc_id"],
- "docnm_kwd": chunk["docnm_kwd"],
- "important_keywords": orSlice(chunk["important_kwd"]),
- "questions": orSlice(chunk["question_kwd"]),
+ "docnm_kwd": utility.FirstNonNil(chunk["docnm_kwd"], chunk["docnm"]),
+ "important_keywords": orSlice(utility.FirstNonNil(chunk["important_kwd"], chunk["important_keywords"])),
+ "questions": orSlice(utility.FirstNonNil(chunk["question_kwd"], chunk["questions"])),
"tag_kwd": orSlice(chunk["tag_kwd"]),
"dataset_id": datasetID,
"image_id": orStr(chunk["img_id"]),
"available": intToBool(chunk["available_int"]),
"positions": orSlice(chunk["position_int"]),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/service/chunk.go` around lines 1124 - 1136, The REST payload builds
the result map using write-side field names (e.g. "content_with_weight",
"important_kwd", "question_kwd", "docnm_kwd") which causes normalized search-hit
data to be dropped; update the map construction around the result variable so it
reads the normalized search-hit keys used elsewhere in this file's List/Get
paths (e.g. "content", "important_keywords", "questions", "docnm") while still
using the existing helpers (orSlice, orStr, intToBool) and datasetID/image_id
handling; locate the result map creation in chunk.go and replace the write-side
keys with the normalized names so responses include existing chunk values.
| // Increment document chunk_num and token_num. | ||
| _ = s.documentDAO.UpdateByID(documentID, map[string]interface{}{ | ||
| "chunk_num": gorm.Expr("chunk_num + 1"), | ||
| "token_num": gorm.Expr("token_num + ?", tokenCount), | ||
| }) |
There was a problem hiding this comment.
Don't swallow the document-counter write failure.
If the chunk insert succeeds but UpdateByID fails, chunk_num and token_num drift from the actual chunk store while the API still returns success. This silently breaks document stats and makes retries/reconciliation ambiguous.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/service/chunk.go` around lines 1285 - 1289, The code currently
swallows errors from s.documentDAO.UpdateByID after a successful chunk insert;
capture and handle the UpdateByID error instead of ignoring it — either wrap the
chunk creation and the UpdateByID call in a single DB transaction and rollback
on any error, or check the returned error from s.documentDAO.UpdateByID and
return it (or propagate a wrapped error) so the API reports failure; ensure you
use the existing chunk insert call and s.documentDAO.UpdateByID symbols and do
not leave the counter update silent on failure.
| questions := []string{} | ||
| if req.Questions != nil { | ||
| for _, q := range req.Questions { | ||
| if q = strings.TrimSpace(q); q != "" { | ||
| questions = append(questions, q) | ||
| } | ||
| } | ||
| d["question_kwd"] = questions | ||
| d["question_tks"] = strings.Join(questions, "\n") | ||
| } | ||
|
|
||
| if req.Available != nil { | ||
| avInt := 0 | ||
| if *req.Available { | ||
| avInt = 1 | ||
| } | ||
| d["available_int"] = avInt | ||
| } | ||
| if req.Positions != nil { | ||
| d["position_int"] = req.Positions | ||
| } | ||
| if req.TagKwd != nil { | ||
| d["tag_kwd"] = req.TagKwd | ||
| } | ||
| if req.TagFeas != nil { | ||
| d["tag_feas"] = req.TagFeas | ||
| } | ||
|
|
||
| // Re-embed when content or questions changed. | ||
| if req.Content != nil || req.Questions != nil { | ||
| em, err := s.getEmbeddingModelForKB(kb, tenantID) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get embedding model: %w", err) | ||
| } | ||
| embedInput := content | ||
| if len(questions) > 0 { | ||
| embedInput = strings.Join(questions, "\n") | ||
| } | ||
| vecs, _, err := embedTexts(em, []string{doc.Name, embedInput}) | ||
| if err != nil { | ||
| return fmt.Errorf("embedding failed: %w", err) | ||
| } | ||
| if len(vecs) >= 2 { | ||
| vec := weightedVec(vecs[0], vecs[1]) | ||
| d[fmt.Sprintf("q_%d_vec", len(vec))] = vec |
There was a problem hiding this comment.
Preserve existing questions when only content changes.
questions starts empty and is only filled from req.Questions. If a chunk already has questions and the caller patches only content, the re-embed path falls back to raw content while question_kwd remains unchanged in storage. The vector then no longer matches the chunk's persisted question state.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/service/chunk.go` around lines 1394 - 1438, When re-embedding (in
the block using getEmbeddingModelForKB, embedTexts, weightedVec) preserve
existing persisted questions if req.Questions is nil: read d["question_kwd"] (or
the stored question list) into the local questions variable when req.Questions
== nil so embedInput uses the persisted questions rather than empty content; if
req.Questions is provided keep current behavior and update d["question_kwd"] as
you already do. Ensure questions is populated before building embedInput and
before computing the weighted vector so vectors stay consistent with persisted
question_kwd.
660805c to
47e915f
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/service/chunk.go (1)
1480-1488:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSwitchChunks does not validate that chunks belong to the specified document.
The
documentIDparameter is accepted but never used. The update condition at line 1482 contains only"id": chunkID, allowing a user with dataset access to toggle availability of chunks from any document in that dataset—not just the document specified in the URL path.Add
"doc_id": documentIDto the condition map so the update is scoped to the correct document:🔒 Proposed fix
for _, chunkID := range chunkIDs { - condition := map[string]interface{}{"id": chunkID} + condition := map[string]interface{}{"id": chunkID, "doc_id": documentID} update := map[string]interface{}{"available_int": availInt} if err := s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID); err != nil { common.Warn("SwitchChunks: failed to update chunk", zap.String("chunkID", chunkID), zap.Error(err)) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1480 - 1488, SwitchChunks accepts documentID but never scopes the chunk update to that document, so UpdateChunks is called with condition {"id": chunkID} and can modify chunks from other documents; update the condition map in the loop inside SwitchChunks to include the document ID (e.g., add "doc_id": documentID) so the call to s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID) only affects chunks belonging to the specified documentID (ensure the key name matches the DB field used elsewhere).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1480-1488: SwitchChunks accepts documentID but never scopes the
chunk update to that document, so UpdateChunks is called with condition {"id":
chunkID} and can modify chunks from other documents; update the condition map in
the loop inside SwitchChunks to include the document ID (e.g., add "doc_id":
documentID) so the call to s.docEngine.UpdateChunks(ctx, condition, update,
indexName, datasetID) only affects chunks belonging to the specified documentID
(ensure the key name matches the DB field used elsewhere).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 74ea85a6-36ef-419f-a4ec-ab3d3c1af790
📒 Files selected for processing (3)
internal/handler/chunk.gointernal/router/router.gointernal/service/chunk.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/router/router.go
- internal/handler/chunk.go
c2d8273 to
95d8fa7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #15670 +/- ##
=======================================
Coverage 93.16% 93.16%
=======================================
Files 10 10
Lines 717 717
Branches 118 118
=======================================
Hits 668 668
Misses 29 29
Partials 20 20 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Hi, @Haruko386, @Hz-186, |
Hz-186
left a comment
There was a problem hiding this comment.
Thanks for the great effort in porting the chunk management APIs to Go!
I've conducted a thorough end-to-end API functional test between the existing Python API (:9380) and the new Go API (:9384).
I found 3 critical bugs that need to be fixed before merging:
1. Go API fails to resolve valid Embedding Models (AddChunk)
When calling POST /api/v1/datasets/.../chunks, the Python API successfully parses and embeds the chunk using the configured model (e.g. Qwen/Qwen3-Embedding-8B@test@SILICONFLOW).
However, the Go API fails with: failed to get embedding model... record not found. The model resolution logic in getEmbeddingModelForKB (or the underlying tenant LLM lookup) seems broken for complex model names/factories, blocking chunk creation entirely.
2. Silent Success on Non-existent Chunk IDs (SwitchChunks)
When sending PATCH /api/v1/datasets/.../chunks with a fake or invalid chunk_id (e.g., ["FAKE_ID"]):
- Python API strictly returns
code: 102(Index updating failure). - Go API returns
code: 0(success).
The Go implementation is missing the validation to check ifupdated_count > 0after hitting the document engine, giving users a false positive success.
3. Severe Schema Mismatch in ListChunksREST
The doc object returned in GET /api/v1/datasets/.../chunks diverges heavily in Go:
- Wrong Keys: Go returns
kb_id,chunk_num, andtoken_num. Python relies ondataset_id,chunk_count, andtoken_count. - Missing Metadata: The Go implementation strips out almost all critical document metadata (e.g.,
parser_config,process_duration,status,size,type). Python includes all of these. If left unfixed, this will cause the frontend console to crash due to missing fields.
Could you please address these logic alignments? Thanks again for your hard work!
Ports four chunk endpoints from Python to Go (infiniflow#15668): GET /api/v1/datasets/:dataset_id/documents/:document_id/chunks (ListChunksREST) POST /api/v1/datasets/:dataset_id/documents/:document_id/chunks (AddChunk) PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks/:chunk_id (UpdateChunkREST) PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks (SwitchChunks) Addresses review feedback (CodeRabbit) and the CI build break: - tokenizer.Tokenize returns (string, error); the bogus contentLtks.(string) assertion on a non-interface value (a compile error breaking CI) is removed — the string is passed straight to FineGrainedTokenize - ListChunksREST page/page_size now parse with strconv.Atoi instead of json.Number(...).Int64() - embedTexts no longer reports len(d.Embedding) (the fixed vector dimension) as token usage; token count is derived from the input text via the project tokenizer (estimateTokenCount), since the embedding driver exposes no provider token usage Rebased onto latest main, resolving the chunk.go import conflict with upstream's RetrievalTest refactor (now using service/nlp): merged the import union (keeping nlp from upstream and xxhash/gorm from this PR).
95d8fa7 to
c50554f
Compare
|
HI, @Hz-186 , |
What problem does this PR solve?
Closes #15668 — ports four missing chunk REST endpoints from Python (
api/apps/restful_apis/chunk_api.py) to the Go API layer.The following were already in Go:
GET .../chunks/:chunk_idandDELETE .../chunks. This PR adds the remaining four:/api/v1/datasets/:dataset_id/documents/:document_id/chunksListChunksRESTlist_chunks/api/v1/datasets/:dataset_id/documents/:document_id/chunksAddChunkadd_chunk/api/v1/datasets/:dataset_id/documents/:document_id/chunks/:chunk_idUpdateChunkRESTupdate_chunk/api/v1/datasets/:dataset_id/documents/:document_id/chunksSwitchChunksswitch_chunksType of change
Implementation notes
Files changed:
Functional parity table:
resolveDatasetAccessiterates user tenants, callskbDAO.GetByIDAndTenantID— mirrorsKnowledgebaseService.accessibledocumentDAO.GetByID+doc.KbID == datasetID— mirrorsDocumentService.query(id=…, kb_id=…)xxhash.Sum64String(content + documentID)— same as Pythonxxhash.xxh64(…).hexdigest()tokenizer.Tokenize+tokenizer.FineGrainedTokenize(existing Go tokenizer)getEmbeddingModelForKBresolves KB → embedding model (same lookup chain asRetrievalTest);0.1 * docNameVec + 0.9 * contentVecweighted averagegorm.Expr("chunk_num + 1")/gorm.Expr("token_num + ?", count)viadocumentDAO.UpdateByID— mirrorsDocumentService.increment_chunk_numcontentorquestionschanges; all other fields (available, positions, tag_kwd, tag_feas) updated without re-embeddingUpdateChunkswithavailable_intper chunk — mirrors Python'sdocStoreConn.updateloopAddChunkreturns{"chunk": {…}}with Pythonkey_mappingfield renames;ListChunksRESTmirrors_strip_chunk_runtime_fieldsoutput